Data Exploration part is based on code from https://www.kaggle.com/code/muhammadfaizan65/machine-failure-prediction-eda-modeling. Added new models and comparisons. Data can be found here https://www.kaggle.com/datasets/umerrtx/machine-failure-prediction-using-sensor-data?resource=download.¶

Dataset Overview¶

This dataset contains sensor data collected from various machines, to predict machine failures in advance. It includes a variety of sensor readings as well as recorded machine failures.

Columns Description¶

footfall: The number of people or objects passing by the machine.
tempMode: The temperature mode or setting of the machine.
AQ: Air quality index near the machine.
USS: Ultrasonic sensor data, indicating proximity measurements.
CS: Current sensor readings, indicating the electrical current usage of the machine.
VOC: Volatile organic compounds level detected near the machine.
RP: Rotational position or RPM (revolutions per minute) of the machine parts.
IP: Input pressure to the machine.
Temperature: The operating temperature of the machine.
fail: Binary indicator of machine failure (1 for failure, 0 for no failure).

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, accuracy_score

# Deep learning and gradient boosting libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import xgboost as xgb
In [2]:
# Load the dataset
file_path = "data.csv"
data = pd.read_csv(file_path)
In [3]:
# Display basic info and summary
print(data.info())
print(data.describe())
print(data.shape)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 944 entries, 0 to 943
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   footfall     944 non-null    int64
 1   tempMode     944 non-null    int64
 2   AQ           944 non-null    int64
 3   USS          944 non-null    int64
 4   CS           944 non-null    int64
 5   VOC          944 non-null    int64
 6   RP           944 non-null    int64
 7   IP           944 non-null    int64
 8   Temperature  944 non-null    int64
 9   fail         944 non-null    int64
dtypes: int64(10)
memory usage: 73.9 KB
None
          footfall    tempMode          AQ         USS          CS  \
count   944.000000  944.000000  944.000000  944.000000  944.000000   
mean    306.381356    3.727754    4.325212    2.939619    5.394068   
std    1082.606745    2.677235    1.438436    1.383725    1.269349   
min       0.000000    0.000000    1.000000    1.000000    1.000000   
25%       1.000000    1.000000    3.000000    2.000000    5.000000   
50%      22.000000    3.000000    4.000000    3.000000    6.000000   
75%     110.000000    7.000000    6.000000    4.000000    6.000000   
max    7300.000000    7.000000    7.000000    7.000000    7.000000   

              VOC          RP          IP  Temperature        fail  
count  944.000000  944.000000  944.000000   944.000000  944.000000  
mean     2.842161   47.043432    4.565678    16.331568    0.416314  
std      2.273337   16.423130    1.599287     5.974781    0.493208  
min      0.000000   19.000000    1.000000     1.000000    0.000000  
25%      1.000000   34.000000    3.000000    14.000000    0.000000  
50%      2.000000   44.000000    4.000000    17.000000    0.000000  
75%      5.000000   58.000000    6.000000    21.000000    1.000000  
max      6.000000   91.000000    7.000000    24.000000    1.000000  
(944, 10)
In [4]:
# Check for missing values
print(data.isnull().sum())
footfall       0
tempMode       0
AQ             0
USS            0
CS             0
VOC            0
RP             0
IP             0
Temperature    0
fail           0
dtype: int64
In [5]:
# Distribution of numeric columns
fig = make_subplots(rows=5, cols=2, subplot_titles=data.columns)
for i, column in enumerate(data.columns):
    row = i // 2 + 1
    col = i % 2 + 1
    hist = px.histogram(data, x=column, template='plotly_dark', color_discrete_sequence=['#F63366'])
    hist.update_traces(marker_line_width=0.5, marker_line_color="white")
    fig.add_trace(hist.data[0], row=row, col=col)

fig.update_layout(height=1200, title_text="Distribution of Numeric Columns", title_font=dict(size=25), title_x=0.5, showlegend=False)
fig.show()
In [6]:
# Correlation Heatmap
corr = data.corr()
fig = ff.create_annotated_heatmap(
    z=corr.values,
    x=list(corr.columns),
    y=list(corr.index),
    annotation_text=corr.round(2).values,
    showscale=True,
    colorscale='Viridis')
fig.update_layout(title_text='Correlation Heatmap', title_font=dict(size=25), title_x=0.5)
fig.show()
In [7]:
# Boxplots for each feature to identify outliers
fig = make_subplots(rows=5, cols=2, subplot_titles=data.columns[:-1])
for i, column in enumerate(data.columns[:-1]):  # Excluding the target column 'fail'
    row = i // 2 + 1
    col = i % 2 + 1
    box = px.box(data, y=column, template='plotly_dark', color_discrete_sequence=['#636EFA'])
    box.update_traces(marker_line_width=0.5, marker_line_color="white")
    fig.add_trace(box.data[0], row=row, col=col)

fig.update_layout(height=1200, title_text="Boxplots of Features", title_font=dict(size=25), title_x=0.5, showlegend=False)
fig.show()
In [8]:
# Scatter plots to visualize relationships between features and target
fig = make_subplots(rows=5, cols=2, subplot_titles=data.columns[:-1])
for i, column in enumerate(data.columns[:-1]):  # Excluding the target column 'fail'
    row = i // 2 + 1
    col = i % 2 + 1
    scatter = px.scatter(data, x=column, y='fail', template='plotly_dark', color='fail', color_continuous_scale='Viridis')
    scatter.update_traces(marker=dict(size=5, opacity=0.7, line=dict(width=0.5, color='white')))
    fig.add_trace(scatter.data[0], row=row, col=col)

fig.update_layout(height=1200, title_text="Scatter Plots of Features vs Fail", title_font=dict(size=25), title_x=0.5, showlegend=False)
fig.show()
In [9]:
# Data Preprocessing
X = data.drop(columns=['fail'])
y = data['fail']
In [10]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [11]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
In [12]:
# Model Training and Evaluation Function
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    # Train the model
    if model_name == 'Neural Network':
        # Neural Network specific training
        model.fit(X_train, y_train, 
                  epochs=100, 
                  batch_size=32, 
                  validation_split=0.2, 
                  callbacks=[EarlyStopping(patience=10)],
                  verbose=0)
        y_pred = (model.predict(X_test) > 0.5).astype(int).flatten()
        y_prob = model.predict(X_test).flatten()
    elif model_name == 'XGBoost':
        # XGBoost specific training with DMatrix
        dtrain = xgb.DMatrix(X_train, label=y_train)
        dtest = xgb.DMatrix(X_test, label=y_test)
        
        # XGBoost training parameters
        params = {
            'objective': 'binary:logistic',
            'eval_metric': 'logloss',
            'random_state': 42
        }
        
        # Use watchlist for early stopping
        watchlist = [(dtrain, 'train'), (dtest, 'eval')]
        
        model = xgb.train(
            params, 
            dtrain, 
            num_boost_round=100,  # max number of boosting iterations
            evals=watchlist, 
            early_stopping_rounds=10,
            verbose_eval=False
        )
        
        y_pred = (model.predict(dtest) > 0.5).astype(int)
        y_prob = model.predict(dtest)
    else:
        # Scikit-learn models
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]
    
    # Evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    cm = confusion_matrix(y_test, y_pred)
    
    # ROC Curve
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    
    return {
        'model': model,
        'accuracy': accuracy,
        'report': report,
        'confusion_matrix': cm,
        'fpr': fpr,
        'tpr': tpr,
        'roc_auc': roc_auc
    }
In [13]:
# Initialize models
models = {
    'Neural Network': Sequential([
        Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
        Dropout(0.3),
        Dense(32, activation='relu'),
        Dropout(0.2),
        Dense(1, activation='sigmoid')
    ]),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'XGBoost': xgb.XGBClassifier(random_state=42, use_label_encoder=False)
}

# Compile Neural Network
models['Neural Network'].compile(
    optimizer=Adam(learning_rate=0.001),
    loss='binary_crossentropy', 
    metrics=['accuracy']
)
/Users/rytis/miniconda3/lib/python3.12/site-packages/keras/src/layers/core/dense.py:87: UserWarning:

Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.

In [14]:
# Hyperparameter grid
params_grid = {
    'Decision Tree': {'max_depth': [5, 10, 15], 'min_samples_split': [2, 5, 10]},
    'Random Forest': {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]},
    'XGBoost': {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 1]}
}
In [15]:
# Store results
results = {}
In [16]:
# Evaluate models
for name, model in models.items():
    print(f"\nEvaluating {name}")
    
    # For Neural Network and XGBoost, we'll use a different approach
    if name == 'Neural Network':
        # Neural Network doesn't use GridSearchCV easily
        results[name] = evaluate_model(model, X_train_scaled, X_test_scaled, y_train, y_test, name)
    elif name == 'XGBoost':
        # XGBoost also uses a different cross-validation approach
        results[name] = evaluate_model(model, X_train_scaled, X_test_scaled, y_train, y_test, name)
    else:
        # Scikit-learn models with GridSearchCV
        grid = GridSearchCV(model, params_grid[name], cv=5, n_jobs=-1)
        grid.fit(X_train_scaled, y_train)
        best_model = grid.best_estimator_
        results[name] = evaluate_model(best_model, X_train_scaled, X_test_scaled, y_train, y_test, name)
        print(f"Best parameters for {name}: {grid.best_params_}")
Evaluating Neural Network
6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 
6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step

Evaluating Decision Tree
Best parameters for Decision Tree: {'max_depth': 5, 'min_samples_split': 10}

Evaluating Random Forest
Best parameters for Random Forest: {'max_depth': None, 'n_estimators': 100}

Evaluating XGBoost
In [17]:
# Print detailed results
for name, result in results.items():
    print(f"\n{name} Results:")
    print(f"Accuracy: {result['accuracy']}")
    print("Classification Report:")
    print(pd.DataFrame(result['report']).transpose())
Neural Network Results:
Accuracy: 0.873015873015873
Classification Report:
              precision    recall  f1-score     support
0              0.882353  0.882353  0.882353  102.000000
1              0.862069  0.862069  0.862069   87.000000
accuracy       0.873016  0.873016  0.873016    0.873016
macro avg      0.872211  0.872211  0.872211  189.000000
weighted avg   0.873016  0.873016  0.873016  189.000000

Decision Tree Results:
Accuracy: 0.8624338624338624
Classification Report:
              precision    recall  f1-score     support
0              0.880000  0.862745  0.871287  102.000000
1              0.842697  0.862069  0.852273   87.000000
accuracy       0.862434  0.862434  0.862434    0.862434
macro avg      0.861348  0.862407  0.861780  189.000000
weighted avg   0.862829  0.862434  0.862534  189.000000

Random Forest Results:
Accuracy: 0.8783068783068783
Classification Report:
              precision    recall  f1-score     support
0              0.891089  0.882353  0.886700  102.000000
1              0.863636  0.873563  0.868571   87.000000
accuracy       0.878307  0.878307  0.878307    0.878307
macro avg      0.877363  0.877958  0.877635  189.000000
weighted avg   0.878452  0.878307  0.878355  189.000000

XGBoost Results:
Accuracy: 0.8465608465608465
Classification Report:
              precision    recall  f1-score     support
0              0.868687  0.843137  0.855721  102.000000
1              0.822222  0.850575  0.836158   87.000000
accuracy       0.846561  0.846561  0.846561    0.846561
macro avg      0.845455  0.846856  0.845940  189.000000
weighted avg   0.847298  0.846561  0.846716  189.000000
In [18]:
# Plotting ROC Curves
fig = go.Figure()
for name, result in results.items():
    fig.add_trace(go.Scatter(
        x=result['fpr'], 
        y=result['tpr'], 
        mode='lines', 
        name=f'{name} (AUC = {result["roc_auc"]:.2f})', 
        line=dict(width=2)
    ))

fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', line=dict(dash='dash', color='gray'), name='Random'))
fig.update_layout(
    title_text='Receiver Operating Characteristic (ROC) Curve', 
    title_font=dict(size=25), 
    xaxis_title='False Positive Rate', 
    yaxis_title='True Positive Rate', 
    template='plotly_dark'
)
fig.show()

Receiver Operating Characteristic (ROC) Curve¶

Interpretation¶

The ROC curve displayed above shows the performance of the RandomForest classifier on the test dataset. The curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.

Key Points:¶

  • True Positive Rate (TPR): Also known as Sensitivity or Recall, it is the ratio of correctly predicted positive observations to the actual positives.
  • False Positive Rate (FPR): It is the ratio of incorrectly predicted positive observations to the actual negatives.

Analysis:¶

  • A perfect classifier would have an AUC of 1.0, while a classifier with no discriminative power would have an AUC of 0.5 (represented by the dashed line labeled "Random").
  • The ROC curve is very close to the top left corner, demonstrating that the model has a high TPR and a low FPR, meaning it correctly identifies a large proportion of positive cases while keeping false positives to a minimum.
In [ ]: